In this blog post, we are going to talk about a few tools for interpretable machine learning. I believe the importance of the subject is rather clear, as we can not only use it to help mitigate and deploying biased models, but also to help us remain relevant should autoML take off in the next 10 years. We will go over the tools in order of (in my opinion) increasing utility, starting with simple Permutation Importance, then PDPs and ICE, and finally ALE.

Permutation Importance: Math and Intuition

Everyone loves tree based models. Gradient boosting, random forests, and friends are wonderful, flexible tools. One of the other benefits of these models, because of their tree-ness, is that we are able to actually see how “important” each variable is in the decisions the model is making. This is also one of many many reasons why we love linear models, we can actually see and quantify the strength of a feature in our model. However, why must we limit ourselves to just linear and tree based models?

Lets try and think of a new approach to get variable importance. When I was first really getting into ML, I remember asking one of my professors the question: “How much time do you spend on feature engineering?”. I will never forget his answer, he told me: “Feature engineering [is] the most crucial part to improve both accuracy and model generation. If you have an unneccessary feature in the model, you are in essence fitting noise.”. This has stuck with me for a long time, and it is a useful thing to keep in mind while discussing permutation importance.

If unneccessary features just provide noise which decreases model accuracy and generalization, what happens if we replace a good feature with noise? Our model should be inherently worse, no? This is the key idea of permutation importance.

If I replace a feature with noise, how much worse does the model perform?

This is the key idea of permutation based variable importance. All we are going to do to calculate this is three simple steps:

  1. Calculate prediction loss

  2. Replace a feature with noise

  3. Recalculate prediction loss

  4. Compare

Formalization

Lets formulate permutation importance mathematically now! First, lets define our data as the set \(x\), with \(m\) observations and \(n\) features. Next, lets consider two sets within \(x\): \(x_s\) and \(x_c\). \(x_s\) represents the feature(s) we are interested in, and \(x_c\) represents the complement of \(x_s\) (in english, everything else). Thus:

\[x = (x_s, x_c)\]

Lets first define the original loss with the original features as \(\mathcal{L}\),

\[ \mathcal{L} = \mathrm{loss}\left(x_s, x_c \right) \]

Next, we need to replace \(x_s\) with noise. To do that, we want to sample the marginal distribtion of \(x_s\). This means we want to sample the distribution of \(x_s\) independent of other features. With a reasonably sized dataset, we can just do a permutation of \(x_s\) for more or less the same result. We will denote the permutation of \(x_s\) as \(x_s^*\). Next, lets define the loss, \(\mathcal{L}*\) of the permuted feature: \[ \mathcal{L}* = \mathrm{loss}\left(x_s^*, x_c \right) \]

Finally, we can calculate the variable importance of \(x_s\):

\[ VIP_{\mathrm{perm}}(x_s) = \frac{\mathcal{L}*}{\mathcal{L}} \]

There we go! Its that simple! An addendum to this suggested by Jeremy Howard of fast.ai: Add a feature of pure noise and see how important that is, for reference.

Code Implementation

For our data, we will use the dataset discussed in Benjamin Tayo’s amazing blog post. We use this dataset because it presents a large challenge to us, with highly correlated features. This will show some of the pitfalls of some of our techniques.

The goal with this dataset is to predict the number of crew members which will be on a cruise ship, given some paramters describing the ship. I believe the independent variables are fairly self explanatory. First, lets read the data into python and do a train test split:

import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston, fetch_california_housing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error as loss_mse
import math
import statistics as stats
import matplotlib.cm as cm
from pprint import pprint

# so this is readable:
import warnings
from sklearn.exceptions import DataConversionWarning, ConvergenceWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=ConvergenceWarning)
cruise = pd.read_csv("https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size/raw/master/cruise_ship_info.csv")


X = cruise.loc[:, cruise.columns != "crew"]
X = X.loc[:, X.columns != "Ship_name"]
X = X.loc[:, X.columns != "Cruise_line"]
y = cruise.loc[:, cruise.columns == "crew"]


def split(df, p_train = 0.75, random_state = 0):
    train = df.sample(frac = p_train, random_state = random_state)
    test = df.drop(train.index)
    return(train, test)

(X_train, X_test), (y_train, y_test) = (split(x) for x in [X, y])

Next, lets use the amazing reticulate package to pass these exact data frames into R:

X <- py$X
y <- py$y
X_train <- py$X_train
X_test <- py$X_test
y_train <- py$y_train
y_test <- py$y_test
X

Now we are all set up to implement permutation importance in python and in R




(Click tabs below to change language!)

In Python

First, lets set up three models to test: A linear model, knn, and a random forest:

lm = LinearRegression()
knn = KNeighborsRegressor(13) 
rf = RandomForestRegressor(n_estimators = 100)

models = [lm, knn, rf]

for m in models:
  m.fit(X_train, y_train)
#> LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
#> KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
#>                     metric_params=None, n_jobs=None, n_neighbors=13, p=2,
#>                     weights='uniform')
#> RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#>                       max_features='auto', max_leaf_nodes=None,
#>                       min_impurity_decrease=0.0, min_impurity_split=None,
#>                       min_samples_leaf=1, min_samples_split=2,
#>                       min_weight_fraction_leaf=0.0, n_estimators=100,
#>                       n_jobs=None, oob_score=False, random_state=None,
#>                       verbose=0, warm_start=False)

Lets check out as a baseline truth which features are important in the random forest, and the strength of the predictors in our linear regression:

pprint(dict(zip(X_train.columns, rf.feature_importances_)))
#> {'Age': 0.019861556573135212,
#>  'Tonnage': 0.39438887203208794,
#>  'cabins': 0.42837889053865646,
#>  'length': 0.08365319867688496,
#>  'passenger_density': 0.016466060536403284,
#>  'passengers': 0.0572514216428321}
pprint({X_train.columns[i]: lm.coef_[:,i] for i in range(lm.coef_.shape[1])})
#> {'Age': array([-0.01321228]),
#>  'Tonnage': array([0.00316145]),
#>  'cabins': array([0.79858261]),
#>  'length': array([0.39111394]),
#>  'passenger_density': array([0.01037499]),
#>  'passengers': array([-0.1045376])}

Looks like tonnage and cabins are the most important variables for the random forest, and cabins, length, and passengers for linear regression.

Next, lets create a function for permutation importance! This should be pretty easy to do, I have added in a few extra components for thoroughness, but the code is not hard:

def permutation_importance(model, x, y, loss, base = False, x_train = None, y_train = None, kind = "prop", n_rounds = 5):
    explan = x.columns
    baseline = loss(y, model.predict(x))
    res = {k:[] for k in explan}
    if (base is True):
        res["baseline"] = []
    for n in range(0, n_rounds):
        for i in range(0, len(explan)):
            col = explan[i]
            x_temp = x.copy()
            x_temp[col] =  np.random.permutation(x_temp[col])
            if (kind is not "prop"):
                res[col].append(loss(y, model.predict(x_temp)) -  baseline)
            else:
                res[col].append(loss(y, model.predict(x_temp)) /  baseline)
        if (base is True):
            x_temp = x.copy()
            x_train2 = x_train.copy()
            x_temp["baseline"] = np.clip(np.random.normal(size = len(x_temp)), -1., 1.)
            x_train2["baseline"] = np.clip(np.random.normal(size = len(x_train2)), -1., 1.)
            mod2 = type(model)()
            mod2.fit(x_train2, y_train)
            if (kind is not "prop"):
                res["baseline"].append(loss(y, mod2.predict(x_temp)) -  baseline)
            else:
                res["baseline"].append(loss(y, mod2.predict(x_temp)) /  baseline)
    return(pd.DataFrame.from_dict(res))

Lets also define a helper function to help us do this for all our models, and then go ahead and calculate the importances!

# convert object name to string!
def get_name(obj):
    name =[x for x in globals() if globals()[x] is obj][0]
    return(name)

imps = {}
for m in models:
    imps[get_name(m)] = permutation_importance(m, X_test, y_test, loss_mse, True,  X_train, y_train, n_rounds = 5)
pprint(imps)
#> {'knn':         Age    Tonnage  passengers    length    cabins  passenger_density  baseline
#> 0  0.917927  17.123217    1.215490  1.005181  0.949343           1.203542  1.311947
#> 1  0.967433  19.350106    1.050657  1.004731  0.979318           1.277555  1.310816
#> 2  0.910865  20.522515    1.181720  1.005119  1.036481           1.290916  1.448880
#> 3  0.996739  22.509468    1.316968  1.005519  0.984456           1.300728  1.313596
#> 4  1.016965  17.550528    1.059935  1.000513  0.982986           1.145060  1.164093,
#>  'lm':         Age   Tonnage  passengers    length     cabins  passenger_density  baseline
#> 0  0.955348  0.974176    4.811749  3.846281  54.143517           0.987116  1.002767
#> 1  0.998085  1.051750    7.126415  3.179625  57.173887           1.082029  1.073530
#> 2  0.988191  0.988083    8.226791  4.240978  66.029099           1.219061  0.994784
#> 3  0.883125  1.109912    5.977181  6.174594  66.395439           1.187754  1.005922
#> 4  1.074860  1.092413    6.208312  5.364766  81.948201           0.992672  1.024712,
#>  'rf':         Age    Tonnage  passengers    length     cabins  passenger_density  baseline
#> 0  1.291075  13.474333    1.603813  1.810496  14.392110           1.520825  1.397843
#> 1  1.259034  16.125429    1.829040  2.634560  17.022706           1.182123  1.972934
#> 2  1.237481  15.453072    1.818316  2.658935  16.303640           1.447842  1.447823
#> 3  1.036346  14.187796    1.593653  2.042860  16.737690           1.340698  1.867108
#> 4  1.258006  20.013529    1.937349  2.117840  17.493989           1.278115  2.240978}

This output is a bit hard to read, so lets go ahead and write a helper function which averages the importances, and plots them nicely!

plt.style.use("ggplot")
def plot_perm_imp(df, ax = None, color = 'blue'):
    # mean the columns and put it back into a data frame
    df1 = (df.apply(stats.mean, 0, result_type = "broadcast")).drop(df.index[1:])
    # create a new data frame excluding baseline, so we can do something special with it
    df_temp = df1.loc[:, df.columns != 'baseline']
    # melt and sort it for plotting
    df2 = df_temp.melt(var_name = 'variable', value_name = 'importance')
    df2 = df2.sort_values(by = "importance")
    # plot it nicely
    df2.plot(kind = 'barh', x = 'variable', y = 'importance', width = 0.8, ax = ax, color = color)
    # draw a bar and an arrow to baseline
    for n in df1.columns:
        if n is "baseline":
            plt.axvline(x = df1[n][0])
            plt.annotate('baseline',
                         xy = (df1[n][0], 1),
                         xytext = (df1[n][0] + 0.4, 3),
                         arrowprops = dict(facecolor = 'black',
                                           shrink = 0.05),
                         bbox = dict(boxstyle = "square", fc = (1,1,1)))


# plot the importance dict!
fig = plt.figure(figsize=(10,10))
for i in range(len(imps.keys())):
    ax = fig.add_subplot(len(list(imps.keys())),1, i+1)
    c = sns.color_palette("hls", i+1)[i]
    plot_perm_imp(imps[list(imps.keys())[i]], ax = ax, color = c)
    ax.set_title(list(imps.keys())[i])
    for tick in ax.yaxis.get_major_ticks():
      tick.label.set_fontsize("x-small")
      tick.label.set_rotation(45)
plt.show()
Permutation Importances for linear regression, knn, and a random forest

Permutation Importances for linear regression, knn, and a random forest

Looks like it worked alright!! Especially for random forest, we nicely identified the two most important features, however with little granularity. With our linear model, it is a little less clear, but we do see that the same three most important features carry through! However, we have no real granularity or idea the scale of the effect of the variable. We also carry a risk with permutation importance and correlated variables in general: we are using often unrealistic observations. For example, we may be looking a tiny boats (in length) which weigh the same as the largest boats (they would sink!!) with our permutations. This leads to often unreliable output with many permutation based tools, which we will also explore in this post.

In R

First, lets set up three models to test: A linear model, a knn model, and a random forest:

library(randomForest)
library(kknn)

# set up models with parameters
rf <- function(df) {
  return(randomForest(crew ~ ., data = df, ntree = 100))
}

# knn in R is annoying so we will need to a consistent api ourselves:

knn <- function(df) {
  res <- list()
  res$train <- df
  res$k <- 13
  res <- structure(res, class = "knn")
}

predict.knn <- function(obj, newdata) {
  out <-  kknn(crew ~ ., train = obj$train, test = newdata, k = 13)
  return(as.numeric(out$fitted.values))
}

linear_model <- function(df) {
  lm(crew ~ ., data = df)
}

models <- list("lm" = linear_model,"knn"= knn,"rf"= rf)

t_train <- cbind(X_train, y_train)

trained_models <- lapply(models, function(f) f(t_train))

Lets check out as a baseline truth which features are important in the random forest, and the strength of the predictors in our linear regression:

pander::pander(summary(trained_models[["lm"]]))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9349 1.878 -0.4977 0.6197
Age -0.01321 0.02083 -0.6343 0.5272
Tonnage 0.003161 0.01692 0.1868 0.8522
passengers -0.1045 0.06831 -1.53 0.1288
length 0.3911 0.1624 2.409 0.01765
cabins 0.7986 0.1058 7.545 1.34e-11
passenger_density 0.01037 0.02618 0.3963 0.6927
Fitting linear model: crew ~ .
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
118 1.089 0.906 0.9009
pander::pander(importance(trained_models$rf))
  IncNodePurity
Age 64.52
Tonnage 326.6
passengers 286
length 275.6
cabins 351.1
passenger_density 35.25

Looks like tonnage and cabins are the most important variables for the random forest, and cabins, length, and passengers for linear regression.

Next, lets create a function for permutation importance! This should be pretty easy to do, I have added in a few extra components for thoroughness, but the code is not hard:

# loss function
loss_mse <- function(truth, preds) {
  error <- truth - preds
  square_error <- error^2
  return(mean(square_error))
}

# get baseline loss
get_loss <- function(model, x, y, loss) {
  loss(y, predict(model, x))
}


# permute a single column
permute_column <- function(df, col) {
  df[[col]] <- df[[col]][sample(1:nrow(df))]
  return(df)
}


permutation_importance <- function(model, x, y, loss, x_train = NA, y_train = NA, n_rounds = 5) {
  baseline <- get_loss(model, x, y[[1]], loss)
  explan <- names(x)
  single_round_imp <- function(df) {
    dfs <- lapply(explan, function(i) permute_column(x, i))
    return(vapply(dfs, function(i) get_loss(model, i, y[[1]], loss), numeric(1)))
  }
  res <- lapply(1:n_rounds, function(x) single_round_imp(df))
  res <- as.data.frame(do.call(rbind, res))
  res <- as.data.frame(lapply(res, function(x) x/baseline))
  names(res) <- names(x)
  return(as.data.frame(lapply(res, mean)))
}

# we can do this in a second because R is speedy 
perm_vips <- lapply(trained_models, function(x) permutation_importance(x, X_test, y_test, loss_mse, n_rounds = 30))
pander::pander(perm_vips)
  • lm:

    Age Tonnage passengers length cabins passenger_density
    1.054 1.065 7.457 4.243 70.4 1.079
  • knn:

    Age Tonnage passengers length cabins passenger_density
    1.259 1.809 2.105 2.951 2.685 1.267
  • rf:

    Age Tonnage passengers length cabins passenger_density
    1.159 6.526 5.099 4.77 8.388 1.533

So, it looks like our importances are quite similar to the importances calculated by the models explicitly, so not a bad job! Lets go ahead and produce a nice plot!

library(tidyverse)
vip_flat <- lapply(perm_vips, gather)

# Finally an appropriate use of superassignment!!
lapply(names(vip_flat), function(x) {
  vip_flat[[x]][["model"]] <<- x
})  %>% invisible

do.call( rbind, vip_flat) %>% as.data.frame %>%
  ggplot(aes(x = key, y = value, fill = model)) + geom_bar(stat = "identity") +
  facet_grid(model ~ ., scales = "free") + coord_flip() + ggthemes::theme_fivethirtyeight() + 
  ggthemes::scale_fill_fivethirtyeight() + ggtitle("Permutation Importance")

Looks like it worked alright!! Especially for random forest, we nicely identified the two most important features, however with little granularity. With our linear model, it is a little less clear, but we do see that the same three most important features carry through! However, we have no real granularity or idea the scale of the effect of the variable. We also carry a risk with permutation importance and correlated variables in general: we are using often unrealistic observations. For example, we may be looking a tiny boats (in length) which weigh the same as the largest boats (they would sink!!) with our permutations. This leads to often unreliable output with many permutation based tools, which we will also explore in this post.

Partial Dependence

Let’s try and think of a annother way to quantify the effect of a variable. We previously answered the question: “How much worse will my prediction be, given a feature has been replaced with noise?”. Now, lets ask a new question:

What will my prediction be, in general, when a feature is at a specific value?

This is Partial Dependence in a nutshell. Stemming off from this, we can make the claim:

If a feature is more important, the prediction will vary more with that feature than others.

This is exactly the claim made (and supported) in this excellent paper. Thus in this section, we will calculate two things: Partial Dependence, and Partial Dependence Importance. Partial Dependence, at a high level, is calculated like this:

  1. Take your original data, copy it
  2. Replace the feature of interest with the value of the first observation of the feature of interest.
  3. Calculate new prediction vector
  4. Average that
  5. Repeat for all observations

And then we can just take more or less the standard deviation of these curves to estimate importance (its a bit different for categorical variables, see the paper).

Formalization

Consider again \(x = (x_s, x_c)\)

We can define the partial dependence as the expected value of our prediction function, \(\hat f\), given \(x_c\):

\[\mathrm{PDP}(z_s) = E \left[ \hat f (x_s, x_c) \mid x_c \right]\]

\[ = \int \hat f (x_s, x_c) \mathbb{P}_c(x_c) dx_c\] Where \(\mathbb{P}_c\) is the marginal distribution of \(x_c\), \(\int p(x)dx_s\).

We can then formulate this for a finite set of \(k\) features and \(n\) observations using the Monte Carlo method:

\[\widetilde {\mathrm{PDP}} (x_j) = \frac{1}{n} \sum_{i=1}^{i=n} \hat f (x_j, x_{\backslash j}^{(i)})\]

Note the change in notation we use when we move from the infinite world of derivatives to the finite world of sums, \(x_j\) represents the feature(s) we are interested in, and \(x_{\backslash j}\) is set notation for “all the other features”, that is \(x_{\backslash j}\) is the complement of \(x_j\). \(x_{\backslash j}^{(i)}\) represents a single observation of \(x_{\backslash j}\). This means, we are holding the value of \(x_j\) constant, and averaging the prediction as \(x_{\backslash j}\) changes .

Code Implementation

In Python

First, lets define a function to calculate the partial dependence of a prediction on a single variable:

def pdp_var(model, x, var):
    explan = sorted(x[var])
    preds = []
    for i in range(0, len(explan)):
        X_tmp = x.copy()
        # pandas is dumb
        val = np.asarray(explan)[i]
        X_tmp[var] = val
        preds.append(model.predict(X_tmp))
    preds = np.asarray(preds).reshape(len(x), len(explan))
    pv = preds.mean(axis = 0)
    return(explan, pv)

Next, lets scale that up to work over an entire data frame!

def pdp_df(model, x):
    # return a dict of data frames
    res = {}
    for c in x.columns:
        d = pd.DataFrame()
        d["Value"], d["Average Prediction"] = pdp_var(model, x, c)
        res[c] = d
    return(res)

Lets now show the partial dependence for just our random forest!


rf_pdp = pdp_df(rf, X_test)

plt.style.use("seaborn-whitegrid")
fig = plt.figure(figsize = (10, 10))
for k in range(0,len(rf_pdp.keys())):
    ax = fig.add_subplot(2,3, k+1)
    #c = cm.Paired(i/len(imps.keys()), 1)
    c = sns.color_palette("hls", k+1)[k]
    df = rf_pdp[list(rf_pdp.keys())[k]]
    sns.lineplot(x = "Value", y = "Average Prediction", ax = ax, color = c, data = df)
    ax.set_title(list(rf_pdp.keys())[k])
plt.subplots_adjust(wspace = 0.5, hspace = 0.5)
plt.show()

Remember, we said that the variability in partial dependence is equivalent to the importance of a variable. Lets go ahead and write that up too!

def pdp_importance(model, x):
    pdpdf = pdp_df(model, x)
    v = {k:np.std(pdpdf[k]["Average Prediction"]) for k in pdpdf.keys()}
    return(v)


pdp_imps = {get_name(m):pdp_importance(m, X_test) for m in models}

def plot_pdp_imp(d, ax = None, color = "blue"):
    df = pd.DataFrame()
    df["Variable"] = d.keys()
    df["Importance"] = d.values()
    df.sort_values("Importance").plot(kind = "barh", x = "Variable", y = "Importance", width = 0.8, ax = ax, color = c)

fig = plt.figure(figsize = (10,10))
for i in range(len(pdp_imps.keys())):
    ax = fig.add_subplot(len(list(pdp_imps.keys())),1, i+1)
    #c = cm.Paired(i/len(imps.keys()), 1)
    c = sns.color_palette("hls", i+1)[i]
    plot_pdp_imp(pdp_imps[list(pdp_imps.keys())[i]], ax = ax, color = c)
    ax.set_title(list(pdp_imps.keys())[i])
plt.show()

In R

First, lets define a function to calculate the partial dependence of a prediction on a single variable:

pdp_var <- function(model, x, var){
  x_sorted <- x[order(x[[var]], decreasing = TRUE),]
  # predefine matrix!
  out <- matrix(nrow = nrow(x), ncol = nrow(x))
  for (i in 1:nrow(x)) {
    tmp_df <- x_sorted
    tmp_df[[var]] <- tmp_df[[var]][i]
    out[i,] <- predict(model, tmp_df)
  }
  res <- colMeans(out)
  return(data.frame("value" = x_sorted[[var]], "avg_pred" = res))
}

Next, lets scale that up over the entire data frame!

pdp_df <- function(model, x) {
  pdps <- lapply(colnames(x), function(n) {
    res <- pdp_var(model, x, n)
    res$variable <- n
    return(res)
  })
  return(pdps)
}

(rf_pdp <- do.call(rbind, pdp_df(trained_models[[3]], X_test)))

Finally, lets plot these pdps!

library(ggthemes)
rf_pdp %>% ggplot(aes(color = variable, y =  avg_pred, x = value)) + 
  geom_line(size = 1.5) +
  facet_wrap(variable ~ ., scales = "free") +
  theme_fivethirtyeight() + scale_color_hc()

Remember, we said that the variability in partial dependence is equivalent to the importance of a variable. Lets go ahead and write that up too! We are going to do a lot of lapply, as we are dealing with a list of lists of dataframes, and trying to quickly get that to a tidy format for ggplot

# return the pdp lists for all trained models
pdps <- lapply(trained_models, function(m) pdp_df(m, X_test))

# calculate the the standard deviation for average predictions
sd_pdp <- function(df) {
  imp <- as.numeric(sd(df$avg_pred))
  return(data.frame("importance" = imp, "variable" = df$variable[2]))
}

# we have a list of lists of data frames, so things are going to get a little weird
# We need to apply our function at depth of two, hence the double lapply
pdp_imps <- lapply(pdps, function(x) lapply(x, sd_pdp))

# next we need to clean up all the items of our lists are tidy data frames/matrices
pdp_imps <- lapply(pdp_imps, function(x) do.call(rbind, x))

# finally before we can plot, we add a character vector indicating model type
pdp_imps <- lapply(names(pdp_imps), function(x) {
  pdp_imps[[x]]$model <- x
  return(pdp_imps[[x]])
  }
)

# then we turn it all into a nice data frame
pdp_imps <- do.call(rbind, pdp_imps)

pdp_imps%>%  ggplot(aes(x = variable, y = importance, fill = model)) + geom_bar(stat = "identity") +
  facet_grid(model ~ ., scales = "free") + coord_flip() + ggthemes::theme_fivethirtyeight() + 
  ggthemes::scale_fill_fivethirtyeight() + ggtitle("Partial Dependence Importance")

Discussion

Something odd is happening here, our partial dependence importances are way off (and if you look at the partial dependence graphs, something is off there too)! If we just focus on the random forest plots, we immediately see that are importances are completely wrong relative to the truth. Why is this? To investigate, lets look at the correlation of our variables:

library(corrplot) 
correlator  <-  function(df){
    df %>%
        keep(is.numeric) %>%
        tidyr::drop_na() %>%
        cor %>%
        corrplot("upper", addCoef.col = "white", number.digits = 2,
             number.cex = 0.8, method="circle",
             order="hclust", 
             tl.srt=45, tl.cex = 1)
}

X_test %>% correlator
Correlation Heatmap

Correlation Heatmap

Our variables are highly correlated! This represents a massive problem to most interpretable ML techniques, especially those which rely on marginal distributions. There are multiple problems here. First, we have already mentioned the problem of unrealistic observations, such as boats which would be physically incapable of floating, or ones that are less buoyant than air. Another problem is the problem of averaging, in which we are actually averaging the effects of the variables correlated with \(x_s\) over \(x_s\). If the correlations are strong and set up right, they can actually nullify the effect of the variable when we take this average. We will now look at two methods which are robust to correlation!

Individual Conditional Expectation (ICE)

ICE is very simple to understand. If we are having trouble with the averaging step with PDP, why don’t we just get rid of it! Partial dependence, through averaging, hides potential complex effects and interactions between the variables. By taking out this averaging step, maybe we can learn a bit more about whats going on!

As a corrolary to this, this paper suggests we can also calculate a correlation robust variable importance with ICE. To do so, we calculate the standard deviation of each ICE curve, and then average those. This should yield an accurate variable importance!

Code Implementation

In Python

The code implementation of ICE is rather simple, it is the exact same as the PDP code, only without averaging! First we will calculate the ICE curve for a single variable:


def ICE_var(model, x, var):
    explan = sorted(x[var])
    med = explan
    preds = []
    for i in range(0, len(explan)):
        tmp = []
        X_tmp = x.copy()
        # pandas is dumb
        val = np.asarray(explan)[i]
        X_tmp[var] = val
        preds.append(model.predict(X_tmp))
    preds = np.asarray(preds).reshape(len(x), len(explan))
    return(explan, preds)

Next, lets go ahead and scale that up for an entire data frame and plot those ICE curves.


def plotICE(ex, values, ax = None):
    for i in range(0, len(ex)):
        if (ax is None):
            plt.plot(ex,values[i,:], 'black', alpha = 0.1)
        else:
            ax.plot(ex,values[i,:], 'black', alpha = 0.1)


fig = plt.figure(figsize = (10,10))
for i in range(len(X_test.columns)):
    iceX, iceY = ICE_var(rf, X_test, X_test.columns[i])
    ax = fig.add_subplot(2,3, i+1)
    plotICE(iceX, iceY, ax = ax)
    ax.set_title(X_test.columns[i])

plt.show()

So, what to these plots tell us? Since there are so many lines, it is hard to glean any truly useful numeric solutions from this. We can tell two useful things from this. The first thing to look at is the range of the peaks of the highest curves to the troughs of the lowest curves. This gives us an idea of the strength of the variable. The next thing to look out for is big splits in the curves, such as those seen in cabins and tonnage. These are indicative of heterogenous effects (interactions between correlated variables), and also show us why PDP fails. If we average these variables with large spits, we are hiding that effect, and cancelling out a lot of it. For example, if you look at cabins, there is a large symmetric split. When we average these curves, we will end up with a pretty flat curve with a very small range. This is why when we have these complicated, correlated, effects, partial dependence (or any tool which relies on averages of predictions of correlated variables) fails. It is also important to look back to the correlation plot and see the lack of correlation in passengers and passenger density. This is evident in the ICE curves as well! We can now look at the variable importance using ICE, and see whether it is improved or not.

def ICE_vips(model, x):
    imps = {}
    for i in range(len(x.columns)):
        icex, icey = ICE_var(model, x, x.columns[i])
        imps[x.columns[i]] = (np.std(icey, 0)).mean()
    return(imps)


ice_imps = {get_name(m): ICE_vips(m, X) for m in models}

def plot_ICE_imp(d, ax = None, color = "blue"):
    df = pd.DataFrame()
    df["Variable"] = d.keys()
    df["Importance"] = d.values()
    df.sort_values("Importance").plot(kind = "barh", x = "Variable", y = "Importance", width = 0.8, ax = ax, color = c)

fig = plt.figure(figsize = (10,10))
for i in range(len(ice_imps.keys())):
    ax = fig.add_subplot(len(list(ice_imps.keys())),1, i+1)
    #c = cm.Paired(i/len(imps.keys()), 1)
    c = sns.color_palette("hls", i+1)[i]
    plot_ICE_imp(ice_imps[list(ice_imps.keys())[i]], ax = ax, color = c)
    ax.set_title(list(ice_imps.keys())[i])
plt.show()

Look at that! The relative importances match up with those calculated both for the random forest and for the linear model, even though we have correlated values!

In R

The code implementation of ICE is rather simple, it is the exact same as the PDP code, only without averaging! First we will calculate the ICE curve for a single variable:

ice_var <- function(model, x, var){
  x_sorted <- x[order(x[[var]], decreasing = TRUE),]
  # predefine matrix!
  out <- matrix(nrow = nrow(x), ncol = nrow(x))
  for (i in 1:nrow(x)) {
    tmp_df <- x_sorted
    tmp_df[[var]] <- tmp_df[[var]][i]
    out[i,] <- predict(model, tmp_df)
  }
  return(data.frame("value" = x_sorted[[var]], "ICE" = t(out)))
}

Sweet, now lets go ahead and scale this up to run over an entire data frame. First, lets scale the ice function:

ice_df <- function(model, df) {
  out <- lapply(colnames(df), function(x) {
    res <- ice_var(model, df, x)
    res$variable <- x
    return(res)
  })
  return(out)
}


# for reference
(rf_ice <- do.call(rbind,ice_df(trained_models[[3]], X_test)))

Next, lets plot those ice curves! This should be pretty simple with a for loop, facet wrap, and aes_string:

plot_ice <- function(df) {
  p <- ggplot(df)
  ice_cols <- names(df)[-c(1, ncol(df))]

  plot_single_curve_set <- function(col) {
    # superassignment modify out of scope variable
    p <<- p + geom_line(aes_string(x = "value", y = col), color = "black", alpha = 0.6) + 
      facet_wrap(variable ~ ., scales = "free")
  }

  # iterate through columns
  for (i in 1:length(ice_cols)) {
    plot_single_curve_set(ice_cols[i])
  }

  return(p + labs(y = "Prediction", x = "Value") + theme_fivethirtyeight())
}

plot_ice(rf_ice)
Random Forest ICE curves

Random Forest ICE curves

So, what to these plots tell us? Since there are so many lines, it is hard to glean any truly useful numeric solutions from this. We can tell two useful things from this. The first thing to look at is the range of the peaks of the highest curves to the troughs of the lowest curves. This gives us an idea of the strength of the variable. The next thing to look out for is big splits in the curves, such as those seen in cabins and tonnage. These are indicative of heterogenous effects (interactions between correlated variables), and also show us why PDP fails. If we average these variables with large spits, we are hiding that effect, and cancelling out a lot of it.For example, the cabins curve has a large symmetric split. When we average these curves, we will end up with a pretty flat curve with a very small range. This is why when we have these complicated, correlated, effects, partial dependence (or any tool which relies on averages of predictions of correlated variables) fails. It is also important to look back to the correlation plot and see the lack of correlation in passengers and passenger density. This is evident in the ICE curves as well! We can now look at the variable importance using ICE, and see whether it is improved or not.

# pretty ugly chain of apply's to be honest, open to suggested improvements
ice_imp <- function(xs) {
  # get the standard deviation of each ice curve and then avergae them, for each variable
  out <- as.data.frame(lapply(xs, function(x) {mean(apply((x[,-c(1, ncol(x))]), 1,sd))}))
  colnames(out) <- sapply(xs, function(x) x$variable[1])
  return((out))
}

# scale up for entire set of models
ices <- lapply(trained_models, function(x) ice_df(x, X_test))
ice_imps <- lapply(ices, ice_imp)

# tidy and plot!
ice_imp_df <- do.call(rbind, ice_imps)

# no need for superassignment!
ice_imp_df$model <- rownames(ice_imp_df)

ice_imp_df %>% gather(key = "key", value = "value", -model) %>%
  ggplot(aes(x = key, y = value, fill = model)) +
  geom_bar(stat = "identity") + 
  facet_grid(model ~ . , scales = "free") +
  coord_flip() + 
  theme_fivethirtyeight() +
  scale_fill_fivethirtyeight() +
  ggtitle("ICE Variable Importance!")
Ice Importances!

Ice Importances!

Look at that! The relative importances match up with those calculated both for the random forest and for the linear model, even though we have correlated values!

Discussion

ICE is impressive, and pretty intuitive. However, the combination of ICE and PDP still have one major issue: the inclusion of unrealistic values, of unfloating boats. We will next look at a solution to this (and the rest of the problems we have been facing)!

Accumulate Local Effects (ALE)

Accumulated Local Effects, otherwise known as ALE, is more or less the most state of the art, fully functional tool we have to understand our models. There are few practical drawbacks of ALE, however there is one big one: it is incredibly challenging to implement in practice. There are only three successful implementations of ALE (to my knowledge): an R package by the author of ALE, an R package by the author of a brilliant IML book, and a work in progress ALE python pacage. Therefor, for the purposes of this blog post, we are not going to go in on the implementation of ALE (this could make up a future blog post - or a series of them), and instead we will focus on understanding it in depth, and of using it. If you are a strong python programmer and would like to work on writing a complete ALE package, tweet me or email me! The code shown here should give you a strong idea of how to formulate ALE on a real dataset.

Understanding ALE

To understand ALE, we must first understand the deficiencies of our other methods. The first is that we are using a marginal distribution of our \(x_c\) and \(x_s\) sets, meaning we are completely ignoring any sort of effects of correlation in our calculations, assuming they are not present. First, let us make two correlated variables to show what we are doing!

n <- 200
r <- 0.9
df <- data.frame(MASS::mvrnorm(n=n, mu = c(0,0), Sigma = matrix(c(1,r,r,1), nrow = 2), empirical = TRUE))
names(df) <- c("x", "y")
# highly correlated data
cor(df)
#>     x   y
#> x 1.0 0.9
#> y 0.9 1.0
p <-  ggplot() + geom_point(data = df, aes(x=x, y=y))
p

Before we get into the details, first let us take a look the ALE algorithm. We will get into each step and try to understand the intuition later:

  1. Break the data (\((x_s, x_s)\)) into \(k\) neighborhoods, bounded by \(z_{k-1, s}\) and \(z_{k, s}\)
  2. For each \(x_c\) in the neighborhood, calculate the prediction at the lower bound and the upper bound, \(\hat f (z_{k-1, s}, x_s)\) and \(\hat f (z_{k, s}, x_s)\)
  3. Difference those predictions. This is known as the “local effect”.
  4. Average the local effects in the neighborhood
  5. Accumulate (sum) the average local effects across neighborhoods
  6. Center the accumulated local effects on zero by subtracting the mean, for readability.

Step One: Break it up!

The first step of ALE is rather intuitive. We want to be dealing in conditional distributions, rather than marginal ones, so we must break up our data into neighborhoods:

# We are following the code example from the IML book, by Christoph Molnar
# He is much smarter than I and the plots in this section should mostly be credited to him!
p <-  ggplot(df) + geom_point(aes(x = x, y = y)) +
  theme(panel.grid.major.y = element_blank(), 
    panel.grid.minor.y = element_blank(), 
    panel.grid.major.x = element_blank(), 
    panel.grid.minor.x = element_blank()) 

grid <- data.frame(x = quantile(-3:3, prob = seq(from=0, to = 1, length.out = 6)))
labels <- data.frame(x = grid[1:5,])
labels$xlab <- labels$x + .5
labels$ylab <- 2.95
labels$labels <- sapply(1:5, function(x) paste0("N(", x, ")"))

breaks <- c(expression(z[0~","~x]),  expression(z[1~","~x]), expression(z[2~","~x]), expression(z[3~","~x]),
  expression(z[4~","~x]), expression(z[5~","~x]))

p2 <- p + geom_vline(data = grid, aes(xintercept = x)) + 
      scale_x_continuous(limits = c(-3, 3), labels = breaks, breaks = seq(from = -3, to = 3, length.out = 6)) +
      geom_label(data = labels, aes(x = xlab, y = ylab, label = labels))
p2

So again, we are interested in the effect of \(x\), so we break the data into \(k\) neighborhoods, bounded by \(x\). We define the upper bound as \(z_{k, x}\) and the lower bound as \(z_{k-1, x}\), where \(k\) is the number of the neighborhood. So in this case, \(x\) is our interest set, \(x_s\), and \(y\) is our complement, \(x_c\)

Step 2: Calculate Local Effects!

This is the really interesting and satisfying part. In PDP, we were calculating the average prediction across variables, which allows their effect to mix in with the effect of \(x_s\). With ALE, we are finding a clever way to remove the influence of \(x_c\). How are we going to do this?

We are going to hold \(x_c\) constant and going to calculate the change in prediction across \(x_s\), cancelling out the effects of \(x_c\). That is:

  • For each instance of \(x_c\) in a neighborhood:
    • push \(x_s\) to \(z_{k-1, s}\), and calculate the lower prediction, \(\hat f (z_{k-1, s}, x_x)\)
    • push \(x_s\) to \(z_{k, s}\), and calculate the upper prediction, \(\hat f (z_{k, s}, x_x)\)
    • calculate the difference, \(\Delta \hat f = \hat f (z_{k, s}, x_x) - \hat f (z_{k-1, s}, x_x)\)

As the number of neighbors approaches infinity, we are getting to the change in prediction with respect to the change in \(x_s\), that is:

\[\widehat {\mathrm{Local Effect}}(x_s) = \frac{\partial} {\partial x_s} \hat f (x_s,x_c) = \lim_{k \to \infty} \Delta \hat f\]

Or visually what we are doing:

ind <- cut(df$x, grid$x, include.lowest = T)

data_grouped <- split(df, ind)
#> names(data_grouped)
#> [1] "[-3,-1.8]"   "(-1.8,-0.6]" "(-0.6,0.6]"  "(0.6,1.8]"   "(1.8,3]"    
data_grouped[[1]]$upper <- -1.8
data_grouped[[1]]$lower <- -3
data_grouped[[2]]$upper <- -0.6
data_grouped[[2]]$lower <- -1.8
data_grouped[[3]]$upper <- 0.6
data_grouped[[3]]$lower <- -0.6
data_grouped[[4]]$upper <- 1.8
data_grouped[[4]]$lower <- 0.6
data_grouped[[5]]$upper <- 3
data_grouped[[5]]$lower <- 1.8

data_grouped <- do.call(rbind, data_grouped)
#> names(data_grouped)
#> [1] "x"     "y"     "upper" "lower"
data_grouped <- rename(data_grouped, initial = x)
data_grouped <- gather(data_grouped, key = "state", value = "loc", -y)


ggplot(data_grouped) + geom_point(aes(x = loc, y = y)) + 
      geom_vline(data = grid, aes(xintercept = x)) + 
      scale_x_continuous(limits = c(-3, 3), labels = breaks, breaks = seq(from = -3, to = 3, length.out = 6)) +
      geom_label(data = labels, aes(x = xlab, y = ylab, label = labels)) + 
      transition_states(state, transition_length = 1, state_length = 2)

So again, we are moving the entire neighborhood to the lower bound, making a prediction, moving the entire neighborhood to the upper bound, making a prediction, differencing those, and averaging those differences. We are averaging them weighted by the conditional distribution of of \(x_c\) in this case, or more formally: \[ E \left[ \hat {\mathrm{local effects}}(x_s, x_c)\mid x_s = z_s \right] = \int \mathbb{p}_{c \mid s}(x_c\mid z_{s})\hat {\mathrm{local effects}}(z_{ s}, x_s)dx_c \]

Step 3: Accumulate and Center

We have made it out of the hairy part, understanding wise. All that is left now is to sum the local effects through each neighborhood, which can be mathematically written as just an integral over the neighborhoods:

\[ \mathrm{ALE}(x_s) = \int_{z_s, 0}^{x_S} E \left[ \hat {\mathrm{local effects}}(x_s, x_c)\mid x_s = z_s \right] dz_s - D \]

Where D is some constant which centers the the ALE on 0 (typically the mean).

Estimating ALE

We must now sadly leave the world of the infinite world of integrals and fun and enter the real world of sums and estimation. We will now talk mathematically about estimating ALE, which is again, not too too hard:

First, we calculate the local effects for each variable in the neighborhood:

\[\Delta f = \left[ \hat f(z_{k,j}, x^{(i)}_{\setminus j}) - \hat f(z_{k-1,j}, x^{(i)}_{\setminus j}) \right]\]

Next, we average those through the neighborhood (note the notation change, just as we did for PDP):

\[ \frac{1}{n_j(k)} \sum_{i:x_j^{(i)} \in N_j(k)}\left[ \hat f(z_{k,j}, x^{(i)}_{\setminus j}) - \hat f(z_{k-1,j}, x^{(i)}_{\setminus j}) \right] \]

Then we accumulate (sum) those throughout neighborhoods:

$$

{}(x_{j}) = ^{k_j(x)} {k=1} {i:x_j^{(i)} N_j(k)}$$

Finally, we subtract the mean to center the ALE on zero!

\[ \mathrm{ALE}(x) ={\widetilde {\mathrm{ALE}}}(x_j)- \frac{1}{n} \sum_{i=1}^n {\widetilde {\mathrm{ALE}}}(x_j^{(i)}) \]

ALE in practice

As I mentioned before, we are not going to implement ALE ourselves in this already very long blog post, instead, we will look at the basic usage of ALE, its interpretation, and what I think to be some clever uses

First, lets load in the necessary library:

library(iml)

forest <- trained_models[[3]]

Next, we are going to simply calculate ALE!

model <- Predictor$new(forest, data = X, y = y$crew)

# calculate effects!
effect <- FeatureEffects$new(model)

# its that easy!
plot(effect)

So, how do we read this? It is very similar to the partial dependence plots, except we think in terms of average prediction. So, if we look at the passengers plot, we are saying that if passengers = 20, the prediction will be 0.5 higher than normal! It is pretty easy to interpret, but it is very powerful. Again, the nice thing about ALE is that we know these variables are in complete isolation of each other, with no sneaky interactions coming in or out. If you are interested in ALE for categorical variables, which is possible, I highly suggest reading the original paper

We can also do ALE in 2 dimensions. What does this show us? This shows, amazingly, the effect of purely the interaction between two variables! This is incredibly powerful. Lets check it out!

# note we are using Effect not Effects this time
ALE <- FeatureEffect$new(model, c("Tonnage","cabins"))
plot(ALE)

Well what does this tell us? This tells us many things! First, we have 100% guaranteed proof that our random forest is modeling an interaction between Tonnage and Cabins, and second, we know exactly what the effect of that interaction is. This is amazing! We can actually use these effects to view how a single prediction is built, which is in my opinion amazing. Before I end this blog post, I will show how to do this:

Anatomy of a Prediction

To understand how our prediction is actually made, we first want to make up a new instance of data, as our test case. We will say that this test case represents the wonderful ship, the S.S. David:

set.seed(33)
(david <- X[sample(1:nrow(X), 1),])
# also grab the predictions, to get the average prediction
preds <- predict(forest, X)
pmean <- mean(preds)

Next, we can hack together some ugly code to get the ALE values from the ALE project. Note, I am not proud of this code, but it certainly works!

# get the results from the effect object for a single feature:

getALE <- function(feat, val) {
  out <- effect$results[[feat]][1:3] %>%  data.frame
  outdt <- out$.feature %>% data.frame
  names(outdt) = "x"
  outdt$val <- val
  mindist <- abs(outdt$x - outdt$val)
  # find the closest effect (it is a grid after all)
  nn <- outdt$x[which.min(mindist)]
  res <- effect$results[[feat]][1:3] %>% 
    filter(.feature == nn) %>% 
    select(.ale) 
  return(as.numeric(res))

  }

Next, we want to scale this up over all variables (this is the way I approach a lot of programming problems, if thats not apparent).

generalEffect <- function(boat) {
  p  <- pmean
  nms <- names(boat)
  ales <- lapply(nms, function(x) getALE(x, boat[[x]]))
  return(ales)
}

# ges for generaleffects
ges <- generalEffect(david)
names(ges) <- names(david)

Next, lets also get the interaction between cabins and tonnage, for fun! Getting all interactions is left as an excercise for the reader ;)

getInteractALE <- function(boat) {
  t <- boat$Tonnage 
  c <- boat$cabins
  tdist <-  abs(ALE$results$Tonnage - t) 
  cdist <-  abs(ALE$results$cabins - c) 
  tnn <- ALE$results$Tonnage[which.min(tdist)]
  cnn <- ALE$results$cabins[which.min(cdist)]
  res <- ALE$results %>%
    filter(Tonnage == tnn & cabins == cnn) %>% select(.ale)
    return(as.numeric(res))
}

ges$interact <- getInteractALE(david)
(ges <- as.data.frame(ges))

Next, lets build the prediction! This is pretty simple, we are going to add the mean to one column, then add in the effects of the other columns. This is best done with a for loop.

ges <- cbind("mean" = pmean, ges)

for (i in 2:ncol(ges)) {
  ges[i] <- ges[i] + ges[i-1] 
}

ge_gather <- gather(ges, key = "variable", value = "prediction") 
# for our lovely animation
ge_gather$idx <- 1:nrow(ge_gather)

ge_gather$variable <- factor(ge_gather$variable, levels = ge_gather$variable)
ge_gather %>% ggplot() + geom_point(aes(x =variable, y = prediction, color = variable), size = 5) +coord_flip() + scale_color_few() + ggtitle("Prediction") + theme_hc() +  transition_reveal(idx) 

And there we have it! We have actually built and looked at the mechanics of our prediction. The amazing part about these tools, is we can use essentially any model that makes a prediction, in any language, and look inside and understand what’s going on. Please note, because I did not include the other interactive effects, the final value of our ALE prediction is actually slightly off, but we can see that in general, it is correct. This plot was inspired by the break-down plots discussed here. For a bit more on ALE, I would reccommend reading the original paper, the IML book referenced below, and this chapter (or whole book).

What About SHAP and LIME?

SHAP and LIME are two other well known IML tools that are used often. However, a paper has recently come out that calls them into question, really for the same reason we questioned permutation importance, partial dependence, and ICE: unrealistic observations. This paper went so far as to train a truly biased model and show how it manages to fool both LIME and SHAP. Until this is resolved, I think our best bet is to use ALE in general, but we have only scratched the surface of IML tools in this blog post. This leads us to our next topic: resources

Software Resources for IML

  • drWhy.AI
    • A collection of AMAZING R-based tools for interpretable ML. Note that these tools even work out of the box with sklearn and Keras, highly recommended
  • iml R package
    • Terrific R package for interpretable machine learning. Highly recommended.
  • ALEpython
    • Python package for ALE, contribute if you can!!
  • yellowbrick
    • Amazing python tool for model visualizations
  • ELI5
    • Permutation Importance and LIME in python. Excellent tutorials

Books and papers which were valuable to me